Communication Across Fault-Containment Firewalls on the SGI Origin
نویسندگان
چکیده
Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolation also increases communication latencies: typically, one has to drop into and out of the kernel to communicate between failure domains. On the other hand, relaxing fault isolation domains allows e cient communication, but at the risk of failure propagation, and thus reduced reliability. We are concerned with nding a middle ground between these extremes. We rst review a few salient aspects of the SGI Origin2000 architecture, mentioning the hardware features germane to e cient communication, and building protectionrewalls. Then, we describe a mechanism for risk-free, point-to-point communication between processes on distinct failure domains. Quoting performance numbers, we show that the overheads of crossing domains render this mechanism unattractive for small messages. To address this issue, we describe a mechanism for controlled opening of the rewalls, thereby achieving explicit inter-partition shared-memory for communication. We describe the kernel software that addresses the resulting reliability issues, and discuss how familiar IPC mechanisms such as MPI and SysV shared-memory can use the explicit sharedmemory to advantage. Finally, based on the lessons learnt, we discuss some future directions, and draw concluding remarks.
منابع مشابه
The performance and scalability of SHMEM and MPI-2 one-sided routines on a SGI Origin 2000 and a Cray T3E-600
This paper compares the performance and scalability of SHMEM and MPI-2 one-sided routines on different communication patterns for a SGI Origin 2000 and a Cray T3E-600. The communication tests were chosen to represent commonly used communication patterns with low contention (accessing distant messages, a circular right shift, a binary tree broadcast) to communication patterns with high contentio...
متن کاملComparing the Memory System Performance of DSS Workloads on the HP V-Class and SGI Origin 2000
In this paper, we present an in-depth analysis of the memory system performance of the DSS commercial workloads on two state-of-the-art multiprocessors: the SGI Origin 2000 and the HP V-Class. Our results show that a single query process takes almost the same amount of cycles in both machines. However, when multiple query processes run simultaneously on the system, the execution time tends to i...
متن کاملFailure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient MPI programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI programs instantiate a specially writ...
متن کاملCooperative Containment of Fast Scanning Worms
Scanning worms, that spread by probing the IP address space to find vulnerable hosts, are among the most serious threats to Internet security today, as evident by the time-scales of some recent large-scale worm attacks. Only an automatic defense can hope to contain a carefully designed worm that uses an unknown or a recently-divulged vulnerability. In this paper, we propose a cooperation-based ...
متن کاملAnalyzing Cooperative Containment of Fast Scanning Worms
Fast scanning worms, that can infect nearly the entire vulnerable population in order of minutes, are among the most serious threats to the Internet today. In this work, we investigate the efficacy of cooperation among Internet firewalls in containing such worms. We first propose a model for firewall-level cooperation and then study the containment in our model of cooperation using analysis and...
متن کامل